Results

Column

Mood Data Differences

Explanation of the plots

The graph in the centre depicts the difference between the human and Spotify ratings for energy and valence, where the human ratings were calculated as the mean of all viable survey responses in the respective categories. More precisely the human ratings were subtracted from the Spotify ratings, so a positive difference in energy and a negative difference in valence, as with Chinese contemporary music for example, means that the survey respondents rated these songs less energetic but more emotionally positive. Both human and Spotify ratings lie on a scale between 0 and 1, so the differences theoretically live within \([-1, 1]\), although in practice no absolute values larger than \(0.5\) were observed. The two-dimensional differences are shown in the scatter plot for each of the eight song categories. Furthermore an overall difference was also defined as the distance between the markers and the plot origin, i.e. \[ \texttt{Overall Difference} \;\;= \;\; \sqrt{\texttt{Valence Difference}^2 + \texttt{Energy Difference}^2} \] This value quantifies the overall inaccuracy of the Spotify energy and valence ratings. The bar chart which can be selected through the button in the top left plots these values for each of the eight categories.

The interactive maps on the right depict the overall difference between human and Spotify ratings depending on the respondents’ birth country. There is one plot for each of the four music cultures studied, which can be toggled using the radio buttons on the top right. The higher the mean rating difference of all respondents from a specific country is the more saturated the shading of that country is in the plot. Note that the accuracy of these values highly depends on the number of responses received from that country. As the scope of this project did not allow for very comprehensive data collection on a global scale, there is only a few responses from each country, which does not allow for any significant conclusions. The exact number of respondents can be read when hovering over a country in the plot.

Column

Regional Response Differences

Test results

The raw distances between the participants’ and Spotify’s ratings of energy and valence and the sum of those distances, representing total accuracy were scaled by converting them to z-scores. Z-scores describe each observation as its distance from the variable’s mean in standard deviations. The following analyses were conducted on scaled variables.

Overall accuracy

The main hypothesis that Spotify’s ratings would be less accurate for non-Western than Western songs was not supported. In actuality, an opposite effect was found: the ratings were more accurate for non-Western music (t(1543.1) = -2.64, p = .008), as shown with a Welsch independent-samples t-test.

A closer investigation of at the differences in accuracy with a two-way ANOVA testing for differences across regions, traditionality and their interaction showed a main effect of region (F(3) = 6.79, p < .001) and a main effect of traditionality (F(1) = 12.55, p < .001), but no interaction effects between the two (F(3) = 1.20, p = .310).

A follow-up series of pairwise t-test shed light on the differences in accuracy across regions. Significant differences were found for Western vs Arab, Western vs Indian, Chinese vs Arab and Chinese vs Indian songs. The alpha-level of significance was adjusted with the Bonferroni correction to 0.83%.

A follow-up Welsch t-test showed that the accuracy was higher for contemporary than traditional songs (t(2168.6) = -3.23, p = .001).

To gain a more detailed understanding of the effects, the study assessed the accuracy in valence and energy ratings separately.

Valence accuracy

A Welsch independent-samples t-test showed that valence ratings were equally accurate for non-Western and Western songs (t(1660.2) = -1.88, p = .060).

A two-way ANOVA testing for differences in valence ratings across regions, traditionality and their interaction found a main effect of region (F(3) = 4.07, p = .007) but no main effect of traditionality (F(1) = 0.90, p = .343). A significant interaction between region and traditionality was observed (F(3) = 2.72, p = .043).

Follow-up pairwise t-test were employed to test the differences in valence ratings’ accuracy across regions. At the Bonferroni-adjusted significance level of 0.83% no significant differences were found across regions.

Energy accuracy

Energy ratings were more accurate for non-Western than Western songs (t(1270.2) = -5.12, p < .001), as shown with a Welsch independent-samples t-test.

A two-way ANOVA testing for differences in energy ratings across regions, traditionality and their interaction found a main effect of region (F(3) = 25.06, p = < .001) and main effect of traditionality (F(1) = 342.14, p < .001). Furthermore, a significant interaction between region and traditionality was present (F(3) = 28.23, p < .001).

Follow-up pairwise t-test were employed to test the differences in energy ratings’ accuracy across regions. At the Bonferroni-adjusted significance level of 0.83% significant differences were found for Western vs Arab, Western vs Indian, Chinese vs Arab and Chinese vs Indian songs.

A follow-up Welsch t-test showed that the accuracy of energy ratings was higher for contemporary than traditional songs (t(1747.6) = -16.41, p < .001).

Background

Column

Background Information

Methods

We conceptualized and distributed our survey using Qualtrics. Participants were first asked to give their consent, after which they were requested to provide information on their country of birth. Next, a definition of valence and energy were supplied to give each participant a better idea of what to keep in mind when rating each of the following songs. Each participant listened to all song clips in a randomized order, ensuring there were no order effects that could have influenced our results. The survey was distributed by making use of our personal network of family, friends, and acquaintances. After collecting responses for about four weeks, we ended up with a total number of 130 respondents. The countries most represented in our final sample were Slovenia, Slovakia, India, the Netherlands, Romania, and Germany. The collected data was then exported for further analysis.

Other interesting findings

  • No significant main or interaction effect of Western/Non-Western music on valence ratings (ANOVA)
  • No significant difference of Western/Non-Western music on the difference in human- and Spotify-made ratings (t-test)
  • No difference in ratings for valence in terms of contemporary and traditional music (t-test)
  • Significantly higher accuracy in energy ratings for contemporary compared to traditional music(t-test)

Bibliography

Discussion

Our research investigated the accuracy of the Spotify API in ranking the energy and valence of Western and non-Western songs. Our findings suggest that Spotify’s ranking accuracy is not consistent among different regions, and especially among classical and modern music. Overall, genres outside the Western mainstream resulted in lower accuracy, possibly due to Spotify’s focus on Western music and user demographics. Supporting this outcome is the fact that the two least accurately ranked styles turned out to be Western classical and Chinese traditional music.

Spotify is a western company with predominantly western user-base, therefore, from a marketing perspective, training the API to be as precise as possible on current popular music makes sense. However, this means that we also need to be careful at interpreting the values Spotify API puts out, as its main goal is not to create a database of songs full of accurate data that is viable for research, but to predict their user’s musical taste in order to keep them satisfied with the product they are paying for. From an ethical standpoint, the Spotify API should strive for accuracy across all cultures and all genres to avoid perpetuating cultural biases and promote equal representation in music recommendation systems.

Further research is needed to investigate the specific factors that influence Spotify to rank certain genres so differently compared to humans. This research, however, would require access to Spotify’s internal data to understand the specificities of each factor, as well as the inner workings of the scales it ranks the songs on.

Limitations

Limited dataset

The bane of our study was to find out if there is a potential bias in regard to how the Spotify API rates Western and Non-Western songs based on their Energy and Valence. By creating a survey where the participants were asked to rate 23 songs from different cultures, we were able to compare these findings to the ones from Spotify. For our data, we were able to collect 102 entries from participants residing in 26 different countries. Therefore, our findings showing geographical response data are based only on a few people from any given country and in that case should not be considered as representing the whole country. This issue would be solved with a larger sample size.

Spotify unknown rating

Another limitation we faced was correctly scaling up our 7-point likert scale and Spotify’s rating of valence and energy on a scale ranging from 0 to 1. Despite the public access to the rating of each song, the method Spotify uses to calculate the actual values remains obscured. Therefore, the most challenging part was a measurement assumption. For this research, we assumed that the distances between values on the Spotify scale are weighted equally. This might not be the case, potentially causing our results to be inaccurate.

Length of Survey

In an ideal setting, we would be able to play the full songs to the participants, and not only 15 second snippets we had to resort to with the aim of keeping the questionnaire under 15 minutes long. The limitation here is that Spotify ranks both valence and energy as an average calculated from a whole song, while our participants had only 15 pre-selected seconds to rate each song. While we tried our best to select a representative clip, this might still cause the accuracy of the measurements to deviate.

Column

Spotify Mood Data

Human Mood Data

Song Data

Column

Column

Clips used in the survey

ID Clip used in the survey Track title Artists Culture Modernity Spotify Energy Spotify Valence Human Energy Human Valence
1 Ghoomar Shreya Ghoshal, Swaroop Khan Indian Traditional 0.805 0.742 0.507 0.549
2 Taal Se Taal Alka Yagnik, Udit Narayan Indian Traditional 0.532 0.614 0.820 0.703
3 Ajab Si KK Indian Contemporary 0.438 0.580 0.402 0.546
4 Chammak Challo Akon, Hamsika Iyer Indian Contemporary 0.879 0.936 0.700 0.725
5 Dil Diyan Gallan Atif Aslam Indian Contemporary 0.640 0.485 0.604 0.602
6 金蛇狂舞 新时代乐队 Chinese Traditional 0.294 0.966 0.783 0.753
7 Liu Yang River Hong Ting Chinese Traditional 0.130 0.214 0.332 0.624
8 二泉映月 黃晨達 Chinese Traditional 0.044 0.205 0.237 0.396
9 郑润泽 Chinese Contemporary 0.480 0.234 0.266 0.339
10 罗刹海市 刀郎 Chinese Contemporary 0.620 0.676 0.681 0.621
11 Fouq annakhl Sabah Fakhri Arab Traditional 0.797 0.642 0.569 0.387
12 Fe Nour Mahyak Umm Kulthum Arab Traditional 0.139 0.588 0.290 0.274
13 Taqaseem bayati Riad Al Sonbati Arab Traditional 0.375 0.318 0.465 0.465
14 Ah W Noss Nancy Ajram Arab Contemporary 0.933 0.967 0.750 0.699
15 Shokran Assala Nasri Arab Contemporary 0.735 0.748 0.635 0.633
16 Orchestral Suite No. 2 in B Minor, BWV 1067: VII. Badinerie Johann Sebastian Bach, Karl Kaiser, Kolner Kammerorchester, Helmut Müller-Brühl Western Traditional 0.269 0.964 0.767 0.772
17 Symphony No.9 In E Minor, Op.95 “Aus der neuen Welt”: 4. Allegro Con Fuoco - Excerpt Antonín Dvořák, Berliner Philharmoniker, Rafael Kubelík Western Traditional 0.231 0.533 0.811 0.620
18 Clair de Lune, L. 32 Claude Debussy, Martin Jones Western Traditional 0.005 0.040 0.184 0.405
19 I Wanna Be Yours Arctic Monkeys Western Contemporary 0.417 0.479 0.394 0.422
20 Unholy (feat. Kim Petras) Sam Smith, Kim Petras Western Contemporary 0.463 0.206 0.730 0.474
21 Boy’s a liar Pt. 2 PinkPantheress, Ice Spice Western Contemporary 0.809 0.857 0.710 0.613
22 I’m Good (Blue) David Guetta, Bebe Rexha Western Contemporary 0.965 0.304 0.882 0.826
23 TQG KAROL G, Shakira Western Contemporary 0.630 0.607 0.619 0.539